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(54) Skin area detection for video image systems 

(57) Apparatus for detecting skin areas in video se- 
quences includes a shape locator and a tone detector 
(both in 12). The shape locator analyzes the input video 
sequences (26) to identify the edges of all the objects 
in a video trame and determine whether such edges ap- 
proximate the outline of a predetermined shape that is 
likely to contain a skin area. Once objects likely to con- 
tain skin areas are located by the shape locator, the tone 
detector examines the picture elements (pixels) of each 
located object to determine if such pixels have signal 



energies that are characteristic of skin areas. The tone 
detector then samples pixels that have signal energies 
which are characteristic of skin areas to determine a 
range of skin tones and compares the range of sampled 
skin tones with the tones in the entire frame to find all 
matching skin tones. An eyes-nose-mouth (ENM) re- 
gion detector is optionally incorporated between the 
shape locator and the tone detector to identify the loca- 
tion of an ENM region on an object that is likely to be a 
face, so as to improve the accuracy of the range of skin 
tones that are sampled by the tone detector. 
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Description 

1. Field of the invention 

The present invention relates to a low bit-rale communication systenn lor multimedia applications, such as a video 
teleconferencing system, and more particularly, to a method of. and system for. identifying skin areas in v.deo images. 

2. Description of the Related Art 

The storage and transmission of full-color, full-motion images is increasingly in demand. These Images are used, 
not only for entertainment, as in motion picture or television productions, but also for analytical and diagnostic tasks 
such as engineering analysis and medical imaging. 

There are several advantages to providing these images in digital form. For example, digital images are more 
susceptible to enhancement and manipulation. Also, digital video images can be regenerated accurately over several 
15 generations With only minimal signal degradation. . , , = 

On the other hand, digital video requires significant memory capacity for storage and equivalently, it requ res a 
high-bandwidth channel for transmission. For example, a single 51 2 by 51 2 pixel gray-scale image with 256 gray levels 
requires more than 256,000 bytes of storage. A full color image requires nearly 800.000 bytes. Natural-looking motion 
requires that images be updated at least 30 times per second. A transmission channel for natural-looking ful color 
20 moving images must therefore accommodate approximately 1 90 million bits per second. However, modem digital com- 
munication applications, including videophones, set-top-boxes for video^n-demand and video 
tems have transmission channels with bandwidth limrtations, so that the number of bits available for transmitting video 
image information is less than 190 million bits per second. 

As a result, a number of image compression techniques such as. for example, discrete cosine transformation 
25 (DCT) have been used to reduce the information capacity required for the storage and transmission of digital video 
signals. These techniques generally take advantage of the considerable redundancy in any natural 'mage so as to 
reduce the amount of data used to transmit, record, and reproduce the digital video images. For example, if the video 
image tobe transmitted is an imageof thesky on aclearday, the discrete cosine transform (DCT) imagedata information 
has many zero data components since there is little or no variation in the objects depicted for such an image. Thus, 
the image information of the sky on a clear day is compressed by transmitting only the small number of non-zero data 

components.^^^^ associated with image compression techniques, such as discrete cosine transformation (OCT) is 
that they produce lossy images, since only partial image information is transmitted in order to reduce the bit rate^ A 
lossy image is a video image which contains distortions in the objects depicted, when the decoded image content is 
compared with the original image content. Since most video teleconferencing or telephony applications are focused 
toward images containing persons rather than scenery, the ability to transmit video images without distortions is im- 
portant This is because a viewer will tend to focus his or her attention toward specific features (objects) contained in 
the video sequences such as the faces, hands or other skin areas of the persons in the scene, instead of toward items, 
such as, for example, clothing and background scenery. 
40 In some situations, a very good rendition of facial features contained in a video sequence is paramount to intellr- 

qibility such as in the case of hearing-impaired viewers who may rely on lip reading. For such an application, decoded 
video image sequences which contain distorted facial regions can be annoying to a viewer, since such image sequences 
are often depicted with overly smoothed-out facial features, giving the faces an artificial quality. For example, fine facial 
features such as wrinkles that are present on faces found in an original video image tend to be erased in a decoded 
46 version of a compressed and transmitted video image, thus hampering the viewing of the video image 

Several techniques for reducing distortions in skin areas of images that are transmitted have focused on extracting 
qualitative information about the content of the video images including faces, hands and the other skin areas of the 
persons in the scene, in order to code such Identified areas using fewer data compression components. Thus, these 
identified areks are coded and transmitted using a larger number of bits per second, so that such areas contain fewer 
50 distorted features when the video images are decoded. ■ ^ , ^ 

In one technique, a sequence of video images is searched for symmetric shapes. A symmetric shape is defined 
as a shape which is divisible into identical halves about an axis of symmetry. An axis of symmetry is a line segment 
which divides an object into equal parts. Examples of symmetrical shapes include squares, circles and ellipses. If the 
objects in a video image are searched for symmetrical shapes, some of the faces and heads shown in the video image 
55 are identifiable Faces and heads that are depicted symmetrically, typically approxinrate the shape of an ellipse and 
have an axis of symmetry vertically positioned between the eyes, through the center of the nose and halfway across 
the mouth Each half-ellipse is symmetric because each contains one eye. half of the nose and halt of the mouth. 
However only those faces and heads that are symmetrically depicted in the video image are recognizable, precluding 
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the identification of heads and faces when viewed in profile (turned to the left or turned to the right), since a face or 
head viewed in profile does not contain an axis of symmetry. Hands and other skin areas of the persons in the scene 
are similarly not symmetric objects and are also not recognizable using a symmetry based technique. 

Another technique, searches the video Images for specific geometric shapes such as. for example, ellipses, rec- 

s tangles or triangles Searching the video Images for specific geometric shapes can often locate heads and faces, but 
still cannot identify hands and other skin areas of persons in the scene, since such areas are typically not represented 
by a specified geometric shape. Additionally, partially obstructed faces and heads which do not approximate a specrfied 
geometric shape are similarly not recognizable. .... 
m yet another technique, a sequence of video Images is searched using color (hue) to identify skin areas including 

10 heads faces and hands. Color (hue) based Identification is dependent upon using a set of specified skin tones to 
search the video sequences for objects which have matching skin colors. While the color (hue) based techniques are 
useful to identify some hands, faces or other skin areas of a scene, many other such areas can not be identified since 
not all persons have the same skin tone. In addition, color variations in many skin areas of the video sequences will 
also not be detectable. This is because the use of a set of specified skin tones to search for matching skin areas 

15 precludes color based techniques from compensating for unpredictable changes to the color of an object, such as 
variations attributable to background lighting and/or shading. 

AccordingV, skin identification techniques that identify hands, faces and other skin areas of persons in a scene 
continue to be sought. 

20 Summary of the Invention 

The present invention Is directed to a skin area detector for identifying skin areas in video Images and, in an 
illustrative application, is used in conjunction with the video coder of video encoding/decoding (Codec) equipment. The 
skin area detector identifies skin areas in video frames by Initially analyzing the shape of all the objects in a video 

25 sequence to locate one or more objects that are likely to contain skin areas. Objects that are likely to contain skin areas 
are further analyzed to determine If the picture elements (pixels) of any such object or objects have signal energies 
characteristic of skin regions. The term signal energy as used herein refers to the sum of the squares of the luminance 
(brightness) parameter for a specified group of pixels in the video signal. The signal energy includes two components, 
a direct current (DC) signal energy and an alternating current (AC) signal energy. The color parameters of objects with 

30 picture elements (pixels) that have signal energies characteristic of skin regions are then sampled to determine a range 
of skin tone values for the object. This range of sampled skin tone values for the analyzed object are then compared 
with all the tones contained in the video image, so as to identify other areas in the video sequence having the same 
skin tone values. The identification of likely skin regions in objects based on shape analysis and a determination of the 
signal energies characteristic of skin regions Is advantageous. This is because the subsequent color sampling of such 

35 identified objects to determine a range of skin tone values, automatically compensates for color variations in the object 
and thus skin detection Is made dynamic with respect to the content of a video sequence. 

In the present illustrative example, the skin area detector is integrated with but functions independently of the other 
component parts of the video encoding/decoding (Codec) equipment which includes an encoder, a decoder and a 
coding controller In one embodiment, the skin area detector is inserted between the input video signal and the coding 

40 controller, to provide input related to the location of skin areas in video sequences, prior to the encoding of the video 

images. . ^ j , . tu„ 

In one example of the present Invention, the skin area detector includes a shape locator and a tone detector The 
shape locator analyzes input video sequences to identify the edges of all the objects in a video frame and determine 
whether such edges approximate the outline of a shape that is likely to contain a skin area. The shape locator is 
45 advantageously programmed to identify certain shapes that are likely to contain skin areas. For example, since human 
faces have a shape that is approximately elliptical, the shape locator is programmed to search for elliptically shaped 

objects In the video signal. , ■ , ,i 

Since an entire video frame is too large to analyze globally, it is advantageous if the video frame of an input video 
sequence is first partitioned into image areas. For each image area, the edges of objects are then determined based 
50 on changes in the magnitude of the pixel (picture element) intensities for adjacent pixels. If the changes in the magnitude 
of the pixel intensities for adjacent pixels in each image area are larger then a specified magnitude, the location of 
such an image area is identified as containing an edge or a portion of the edge of an object. 

Thereafter identified edges or a portion of identified edges are further analyzed to determine if such edges, which 
represent the outline of an object, approximate a shape that is likely to contain a skin area. Since skin areas are usually 
55 defined by the softer cun/es of human shapes (e.g.. the nape of the neck, and the curve of the chin), ngid angular 
borders are not typically indicative of skin areas. Thus, configurations that are associated with softer human shapes 
are usually selected as likely to contain skin areas. For example, since an ellipse approximates the shape of a person s 
face or head the analysis of a video sequence to identify those outlines of objects which approximate ellipses, advan- 
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taqeously determines some locations in the video sequence that are likely to contain skin areas. Also, '"the^ context 
ofvideo conferencing, at least one person is typically facing the camera, so if one or more persons are ,n the room, 
thpn it is likelv that an elliptical shape will be identilied. „:„4...^ 
Or^ce objects Ikely to contain skin areas are located by the shape locator the tone detector examines the p ctu e 
elemen"(Sot each located object to determine if such pbcels have signal energies that are f --'^ ^^-^^^^^^^^ 
a eTs then samples the range of skin tones lor such identified objects and compares the range of sampled sk n tones 
with the tones in the entire frame to determine all matching skin tones. In the present embodiment, the signal energy 
^ronenMDC and AC energy components) of the luminance parameter are advantageously determined using the 

''^^;reTeri~ c^iscrete cosine transform (OCT) of the signal energy for a spec, a 

qroup of pixels In an object Identified as likely to contain a skin area is calculated. Thereafter, the AC energy co^Ponem 
o each pS is determined by subtracting the DC energy component for each pixel from the discrete ^os^^^ranslonn 
(DCT). Based on the value of the AC energy component for each pixel, a determination is made ° ^^^^f ^^^^ 
oixels have an AC signal energy characteristic of a skin area. If the AC signal energy for an examined pixel is ess than 
a speSd "l^e Sl'such pbcels are identified as skin pixels. Thereafter, the tone detector -/^^ 
parameters of suchTdentiLd pixels and determines a range of color parameters indicative of skin tone that are con- 

tained within the region of the object. * ^ o xh^i 

The color parameters sampled by the tone detector are advantageously chrominance parameters, an^J^.. The 
term chrominance parameters as used herein refers to the color difference values of the ^'^eo signal wherein C is 
defTneS asTe cJfference between the red color component and the luminance parameter (Y) of the v^Jeo signal and 
ct is de'ned as the difference between the blue color component and the luminance (Y) P-^-^J^ ^^^^^^^^^^ 
The tone detector subsequently compares the range of identified skin tone values from the sampled object with the 
color oarameters of the rest of the video frame to identify other skin areas. 

" he Srea detector of the present invention thereafter analyzes the next f rarrje of the video sequence ^^^^^^ 
termlVe the range of skin tone values and identify skin areas in the next video frame The skin area delecto op^ona'^ 
uses me range of skin tone values identified in one frame of a video sequence to identify skin areas ,n subsequent 

'"Th^ll^lTdeTe'SoT^Jiona.ly includes an eyes-nose-mouth (ENM) region detector for analyzing some objects 
which aooroximate the shape of a person's face or head, to determine the location of an eyes-nose-mouti^ (ENM) 
Teg on TnTne"mb<Siment,'he ENM region detector is inserted between the shape locator and tje tone detector to 
dS the location of an ENM region and use such a region as a basis for analysis by the tone detector. The eyes- 
nose moum^NM) region detector utilizes symmetry based methods to identify an ENM region located w hin an object 
wWchTpproLates L shape of a person's face or head. It is advantageous for '^^^^^^-^^'^-^^^^^^^^^^ 
to be ident ified since such a region of the face contains skin color parameters ^ ^^^^fj ^^^^^^^ 
. skin tone parameters, including for example, eye color parameters, eyebrow color parameters, lip ^^^'J^^f^ 
and haJr coL parameters. Also, the identification of the eye-nose-mouth (ENM) region reduces computational com- 
niexitv since skin tone parameters are sampled from a small region of the identified object. ^ . ., ^ ^ . ,. „ 
Other objecu and features of the present invention will become apparent from the following detailed description 
consTdered 2 ctnjunction with the accompanying drawings. It is to be understood, however, that the drawings are 
desSed solely for purposes of illustration and not as a definition of the limits of the invention, for which reference 
should be made to the appended claims. 

Brief Description of the Drawings 

FIG 1 . is a block diagram of a video coder/decoder (Codec) embodying an illustrative application of the principles 
of the present invention; 

FIG. 2 is a block diagram of the skin area detector of the present invention; 
FIG. 3 shows a block diagram of the shape locator of FIG. 2; 
FIG. 4 is a block diagram of the preprocessor circuit of the shape locator of FIG. 3; 
50 FIG 5 shows a block diagram of the tone detector of FIG. 2; 

PIG. 7 rbl^cVdragir ^\Sfs'kin area detector including an eyes-nose-mouth (ENM) region detector; and 

FIG. 8 illustrates a rectangular window located within an ellipse. 

ss Detailed Description 

FIG 1 shows an illustrative application of the present invention wherein a skin area detector 12 is used incon- 
junc'JS With aTdeo coding/decoding system such as, for example, video codec 10 (coder/decoder). Video coding/ 
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decoding o^video image sequences based on .mage ."=°";P;^^^'°^^^^^^^^^^ the Discrete Cosine Transform (DCT) 

fecCue useful lor the ceding and decoding o^^ 

method, described in ITU-T Recommenda .on H.263 ^° ^ ^^^^^ ^^^^ ^ ^^^^ coder/decoder 

d,s°ed located ««hi„ v«.,=. ~ ^^^fvij. cS.f 10 InC.d,. .«! -»-P--' 

Uom. me ome- c^ponen, pa,1= ol .j, s„ch component pans »1« be d.ecuee.d ,n con- 

?-,s:rrcr;«rrnr r::^^^ ,3 , .^^ . so .d . 

Le=rr,:^=tip.isq^^^ 
rs,rcSr.^^rrr,"=r^n=;srro^ptU^^ 

a DiuralttY of individual processors. ohaoe locator 50 and tone detector 56 .s not to be 

A?so the use of the individual functional blocks ^^P^^^^^^^ "2^J°Sa^p,es of additional illustrative embod.- 
consuued to refer exclusively to hardware capable o exec|^.ng l^^^^l'^^ J^^^^ ^ hardware, such as the AT&T 
S .or the functional blocKs described above jc ude ^^^^ ^^^^^^.^^3 ,.,3,,333, .elov. and 

DSP16 or DSP32C. read-only memory (^iO'^V? Jnn^VorSror (DSP) results. Veiy large scale integrat.on (VLSI) 
random access memory (RAM) for storing d.g. al s.gna^ Tn ^!Sna ion Sth a general purpose digital signal processor 
hardware embodiments, as well as ^-tom VLSI c-^^^^^^^ ,,,,ed to fall within the mean.ng 

rnqp^ circuit are also optionally contemplated. Any and/or an sucn 

Luhef notional blocks labeled shape locator -jj ^^^^^^^e^^^^^^^^ shape locator 50 initially locates one or 
The present invention identifies skin areas .n '"^^Sf^j^^J^Jedges of all objects in the video frame and a 

more likely skin areas in a video frame based ^^ ^^f^^^'^^^Z o a predetermined shape. The analysis of edges 

Hh«»rt;^^^^^ 

location of some skin areas. thor^aft^r analvzed by tone detector 56 to determine whether the picture 

Objects identified as likely skin areas are '"^^'^^^^'^^^^IZZes characteristic of skin regions. The term signal 
elements (pixels) of any such object or ^Jie^^^^^^? ^"^^^^^^^^^ 

energy as used in this disclosure refers tothesumofthesquar^ot^^^^^^ ^^^^^^^ ^.^^^1 ^^^^gy ^^d an 

grou? of pixels In the video signal and .ncludes ^^^/^X" t^^ picture elements (pixels) that have signal 
a ternating current (AC) signal energy. ^^^^ P^^^^^^^^^^^^^^ ^ange of skin tone (color) values for the object^ 
energies characteristic of skin regions are then «f/"P'^^^° ^^'"^ "^f 3,, ^^e tones contained in the video .mage so as 
T^eTanqe of skin tone values for the object are then ^°"^P^^^.'^;'^f" 'values When skin areas are identified based 
^ordS other areas in the video sequence having the ^^^^^^^''^'^^'^^^^^^ detection is made dynamic with 
on an a a°ysis of signal energies, followed ^^V^^^^VutT^^e" sk^ton^^^^^^^^^ identified objects automatically com- 

;-^runTe^^^^^^^^^ 

-™p^onentpansofbothshapelocator5^^^^ 

2 a. can oC an explanation ol the opetat»n ol slon area detector IZJ^ P 50 |,o,„ , conven- 

'i ZT. ^cneep^ndln, ,0 an ,n»ge o, an '^^''^^ X^^^'^T.^^^ZS:^'"'^ W snarp Corporation^ Shape 
tlonal video camera (not shown) such as. lor '^f Tj^; ™ ™™ ^ „ Mentlly the edge, ot all the ob,eols ,n the 
locator SO analyzes at least one ol the Irames ol the input video sighai ^ .^^1^^^ ^ j^,„ 

'S: i°d determine , an edge or a po«n H^l'^^^'^^ZS^^^-^ c.,n,.s The terrr, cdn,e as used In 

fh^ritnrntCK;^^^^^^^^^ 

as „Si=t:.rn:; r : rSar -i^-r Xpo ..ter ,0.. .h. shape .Itter ,0. ..nerates a shape 

rjsrir?r3n:?nSdrgtd'^^^^^ 
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IdonWiSio" b, ..teoOng, lo. analy.is. only a """^H °™ " a video signal su* as Input video signal M 
TOus rSding *e p.esen. example, assume tn.l '« ""^.Tai^te /« a ,e,ul., the downsampl.r reduces M 

™^rra:-~^ 

360 x?« pixels and w«h a cu.^ll l-«iuenc, =1 '"'■'^'''"^^^^^^'^ Sequences. «>en a signal su* as 
Sc«. discussed bew. Typically, a lille. sucn » contained In the yideo signa M 

*":rnr;x'"-S^^^^^^^ 

Te Is^m^ -g. o, aellned .requencie. .Oi^^^^^^^^^^^^^ 

3=;ridtSir=e%s^^^ 

^^S^Stnlo image ..ea. «l.b ■^"f^'fJ'J, J* S me p,« image area. c. Ih. .Ideo 'T^J'^ 

E«i™h^-"r;.=^— ^^^^^ 

area, are nearly equivalent, as shown in nnatrix A. 
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are convolved with the magnitudes o1 the pixel intensities for adjacent pixels in an image area that does not contain 
the edge of an object such as, for example, the pixel intensities for adjacent pixels of matrix A, as shown below. 
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the resulting convolution produces, in part, the result shown below, 

6^ = (-1 X 11) + (0 X 10) + (1 X 10) + (-2 X 10) + (0x10) + (2 x 10) + (-1 x 10) 
+ (0x10) + (1 x11) = 0 



5y^ = (1 x 1 1) + (2 X 10) + (1 X 10) + (0 X 10) + (0 X 10) + (0 x 10) + (-1 x 10) 
-^(-2x10) + (-1 x11)=0 

whose magnitudes approximate zero in two dimensions. In contrast. It the Sobel operators are convolved with the 
magnitudes of pixel intensities for adjacent pixels in an image area that contains the edge of an ob)ect such as. for 
example, the magnitudes of the pixel intensities for adjacent pixels, shown in matrix B. the resulting convolution pro- 
duces, in part, the result shown below, 

= (-1 X 10) + (0 X 50) + (1 X 90) + (-2 X 50) + (0 x 50) + (2 x 90) + (-1 x 90) 

+ (0x90) + (1 x90)= 160 

= (1 X 10) + (2 X 50) + (1 x 90) + (0 x 50) + (0 x 50) + (0 x 90) + (-1 x 90) 
-f (-2x90) + (-1 x90) = -160 

whose magnitudes do not approximate zero. Edge detection techniques utilizing, for example, the above described 
Sobel operator techniques, are performed for each of the partitioned 45 x 30 pixel areas of the video f rarne^ 

Thresholding circuit 126 then identifies those pixels in each 45 x 30 partitioned area, whose magnitude of con- 
volved squared and summed pixel intensities for adjacent pixels are larger than a specified value, assigning such 
identified pixels a non-zero numerical value. Pixels having a magnitude of convolved, squared and summed p.xe 
intensities for adjacent pixels less than the specified value of the thresholding circuit 126. are assigned ^ ^^^^'^^^ 
value Edge data signals 128 corresponding to the non-zero pixel values are subsequently generated by he thresh- 
olding circuit 126. The Incorporation of a thresholding circuit, such as. thresholding circuit 126. advantageously prevents 
contoured skin areas that are not edges from being misidentitied as edges. This is because small variations in the 
magnitudes of the pixel intensities for adjacent pixels typically produces convolved, squared and summed magnitudes 
that are less than the specified value of the thresholding circuit 1 26. 

Referring again to FIG. 3. the edge data signals 128 generated by the shape location preprocessor 94 are input 
to the coarse scanner 100 of shape locator 50. The coarse scanner 100 segments the edge data signals 128 provided 
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by the shape location preprocessor 94, into blocks of size B x B pixels; for example, of size 5x5 pixels. Each block is 
then marked by the coarse scanner 100, if at least one of the pixels in the block has a non-zero value, as discussed 
above. The array of segmented 8 x B blocks is then scanned in for example, a left-to-right, top-to-bottom fashion, 
searching for contiguous runs of marked blocks. For each such run of marked blocks, fine scanning and shape fitting 
are performed. The inclusion of coarse scanner 100 as a component part of shape locator 50 is optional depending 
on the computational complexity of the system utilized. The fine scanner 102 scans the pixels in each contiguous run 
of segmented and marked B x B blocks, for example, in a left-to-right, top-to-bottom fashion, to detect the first pixel in 
each line of pixels that has a non-zero value and the last pixel in each line of pixels that has a non-zero value. The 
first and last non-zero detected pixels of each line are labeled with coordinates (Xs,art' V) (^end' V)' respectively. 

The shape fitter 104 scans the coordinates labeled (Xs,art, y) and (x^^d* V) ^ach line of pixels. Geometric shapes 
of various sizes and aspect ratios stored in the memory of the shape fitter 104 that are likely to contain skin areas are 
then compared to the labeled coordinate areas, in order to determine approximate shape matches. Having determined 
a shape outline from a well fitting match of a predetermined shape that is likely to contain a skin area such as, for 
example, an ellipse, the shape locator 50 generates a shape location signal 106 based on the coordinates of the well- 
fitted shape, and provides such a shape location signal 106 to the tone detector 56. 

Once shape locator 50 has identified the location of an object with a border that indicates the object is likely to 
contain a skin area, tone detector 56 functions to analyze whether such an object contains signal energies that are 
characteristic of skin regions. If the object contains signal energies that are characteristic of skin regions the tone 
detector 56 samples the color parameters of the object, in order to identify a range of skin tone values. The tone detector 
56 then compares the identified range of skin tone values to the color parameters of the rest of the video frame to 
identify other areas containing the same skin tone values. 

Color digital video signals contain red (R), green (G) and blue (B) color components and are typically available in 
a standard YUV color video format, where Y represents the luminance parameter and both U and V represent the 
chrominance parameters. The luminance (Y) parameter characterizes the brightness of the video image, while the 
chrominance (U,V) parameters define two color difference values, C^and C^. The relationships between the luminance 
(Y) parameter, the color difference values, and C^, andthe three color components R, G and B are typically expressed 
as: 

Y = 0.299R +0.587G = 0.114B 
C, = R - Y 

In one embodiment of the present invention, tone detector 56, as shown in FIG. 5, includes a skin region detector 
200, a histogram generator 201, a histogram generator 203, a range detector 205, a range detector 207 
and a tone comparator 209. 

Skin region detector 200 correlates the input video signal 26 with the shape location signal 106, so that the objects 
identified in the video frame, by the shape locator 50 are segmented into blocks of D x D pixels. Skin region detector 
200 advantageously segments the identified shape into blocks of 2 x 2 pixels, where D = 2, in order to obtain one 
luminance parameter for each pixel as well as one value and one Cj^ value for every block of 2 x 2 pixels. As an 
illustrative example, FIG. 6 shows a 4 x 4 block of pixels 300. A luminance parameter (Y) 301 is present for each pixel 
300. In contrast, each block of 2 x 2 pixels 300 has one value 302 and one C5 value 303, which is present at the 
1/2 dimension in both the horizontal and vertical directions. Thus, each block of 2 x 2 pixels includes four luminance 
(Y) parameters 301, as well as one value 302 and one value 303. Such segmentation, to include only one 
value and only one C5 value is important when skin tone sampling is performed for an identified object, as discussed 
below. 

Skin region detector 200 functions to analyze which of the blocks of D x D pixels lying within the perimeter of an 
identified object represents skin areas by determining whether each D x D block of pixels have signal energies char- 
acteristic of a skin region. The luminance (Y) parameter of the color video signal has two signal energy components: 
an alternating current (AC) energy component and a direct current (DC) energy component. Skin area pixels typically 
have AC energy components with values less than a specified threshold energy, T^„. 

In an embodiment of the present invention, skin areas are detected based on the calculation of the AC energy 
components for the luminance (Y) parameter of the color video signal. Methods including the discrete cosine transfor- 
mation (DCT) technique, as described In ITU-T Recommendation H.263 ("Video coding for narrow communication 
channels") are useful for calculating the signal energies of the luminance (Y) parameter As an illustrative example, 
the AC energy components and the DC energy components of the luminance parameters for each block of D x D pixels, 
is determined by first calculating the discrete cosine transformation (DCT) function, F(u, v) for each pixel as shown 
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below, from equation (1) 

F(u,v) = C(u)C(v) 2: t f(\, j)cos^^i±^n cos ^^l^n 



/=0 y=0 

(1) 



10 where F(u, v) represents the discrete cosine transformation (DCT) function and C(u) and 
C(v) are defined as 



C(o>) = 1/72 tor o> 0 



C(co) = 1 for CO = 1,2,3,. 



which are summed for each pixel location F(u,v) of the block of D x D pixels. The AC signal energy, E(m, I), is then 
20 determined by subtracting the square of the direct current (DC) signal energy, F^^ , (0,0), from the square of the discrete 
cosine transformation function F(u, v), as shown in equation (2) 



En,.l = t Z Fm,.(",v)2 - F^,(0.0)^ (2) 



The AC signal energy, E (m, I), is then compared to a threshold energy, Ten. For each D x D block of pixels, if the 
30 AC signal energy, E(m, I), is less than a preselected threshold energy T^^, the block of pixels is identified as a skin 
area, as indicated below, 

E (m, I) < T^^ Skin area 

35 

E (m, I) > Tg^ Non-skin area 

Typically, when a D x D block of pixels has an AC signal energy value that is less than 120,000 such a block of 
40 pixels is identified as a skin region. It is advantageous to utilize the signal energy components of the luminance pa- 
rameter to determine skin areas, since non-skin areas tend to have much higher signal energy components than do 
skin areas. Identifying such non-skin areas and eliminating them from the color sampling process increases the prob- 
ability that the color of a sampled pixel is actually a skin area pixel, and thus improves the accuracy of the range of 
tones to be sampled. 

45 Once a block of D x D pixels has been identified by the skin region detector 200, as a skin region, the C^ values 

and the C^ values of the block of D x D pixels are sampled by the C^ histogram generator 201 and the C^ histogram 
generator 203, respectively. As previously discussed, it is advantageous if the blocks of D x D pixels, are 2x2 blocks 
of pixels, since such blocks contain only one C^ value and one C^ value. Both the C^ histogram generator 201 and the 
Cb histogram generator 203 then generate histograms for the sampled C^ and C^ values, respectively. 

50 Once a C^ histogram and a C^ histogram have been generated, the range of color parameters representative of 

skin tone for the sampled object are determined by the C^ range detector 205 and the C^ range detector 207 using 
statistical analysis techniques. For example, with each data set the mean and mode C^ and C^ values are determined 
for each block of D x D pixels sampled. When the mean and mode C^ and C^, values are within some specified distance, 
Dp, of each other, such mean and mode C^ and C^^ values are identified as representing a single peak. Thereafter, for 

55 each block of D x D pixels, if a pixel color parameter is within a predetermined distance, for example, one standard 
deviation, of such mean and mode Cr and C^ values representative of a single peak, than the pixel color parameter is 
included in the range of skin tone values. When the mean and mode are within a distance greater than the specified 
distance, Dp, such mean and mode C^ and C^ values are identified as representing two individual peaks. The pixel 
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color parameters for blocks of D x D pixels with mean and mode and values that are representative of two 
individual peaks are not included in the range of skin tone values. 

Based on the range of and values generated in the range detector 205 and the range detector 207, 
respectively, the tone comparator 209 analyzes the entire frame of the input video signal 26, to locate all other areas 
containing the same chrominance parameters. When such other regions are located, a skin information signal 211 
denoting the location of the skin areas is generated by the tone comparator 209. 

Skin area detector 12 performs the above described analysis for each frame of a video sequence or optionally 
analyzes a single frame and then the tone comparator 209 utilizes that range of skin tone values to identify skin areas 
in a specified number of subsequent frames. 

In the embodiment of the present invention, wherein the outline of an object or objects identified by the shape 
locator 50 match well fitting ellipses and before such a shape or shapes have been verified to contain skin areas, a 
shape location signal 106 generated by the shape locator 50 is optionally provided to an eyes-nose-mouth (ENM) 
region detector 52, as shown in FIG. 7. The ENM region detector 52 receives the coordinates of the well-fitted elliptical 
outlines from the shape locator 50 and segments the elliptical region into a rectangular window 60 and a compliment 
area 62 (containing the remainder of the ellipse not located within rectangular window 60), as shown in FIG. 8. The 
ENM region detector 52 receives the elliptical parameters and processes them such that a rectangular window 60 is 
positioned to capture the region of the ellipse corresponding to the eyes, nose and mouth region. 

The ENM region detector 52 determines a search region for locating rectangular window 60 using the search 
region identifier 108, where the coordinates of the center point (Xq, yo) of the elliptical outline as shown in FIG. 8 are 
used to obtain estimates for the positioning of the rectangular window 60. The search region for locating the center 
point of the ENM region is a rectangle of size S x T pixels such as, for example, 12x15 pixels, and is advantageously 
chosen to have a fixed size relative to the major and minor axes of the elliptical shape.outline. The term major axis as 
used in this disclosure is defined with reference to FIG. 8 and refers to the line segment bisecting the ellipse between 
points y^ and yg. The term minor axis as used in this disclosure is also defined with respect to FIG. 8 and refers to the 
line segment bisecting the ellipse between points x-, and Xa- As an illustrative example, assume the ellipse has a length 
along the major axis of 50 pixels and a length along the minor axis of 30 pixels. The size of the rectangular window 60 
is advantageously chosen to have a size of 25 x 1 5 pixels, which approximates half the length of the ellipse along both 
the major and minor axes and captures the most probable location of the eyes-nose-mouth region of the shape. 

Once rectangular window 60 is located within the ellipse, the search region scanner 110 analyzes the rectangular 
window to determine each candidate position for an axis of symmetry with respect to the eyes-nose-mouth region of 
the ellipse. For example, search region scanner 110, in a left-to-right fashion, selects each vertical row of pixels within 
rectangular window 60 using a line segment 64 placed parallel to the major axis, in order to search for an axis of 
symmetry, positioned between the eyes, through the center of the nose and halfway through the mouth. After the axis 
of symmetry is determined with respect to the facial axis, the ENM region detector 52 generates an ENM region signal 
54 corresponding to the coordinates of the resulting eyes-nose-mouth region of the rectangular window 60. The ENM 
signal 54 notifies the tone detector 66 of the coordinates for the location of the eyes, nose, and mouth region of the 
object so that pixels not included in such region are excluded from subsequent color parameter analysis. It is advan- 
tageous for the eyes-nose-mouth region to be identified since such a region of the face contains skin color parameters 
as well as color parameters other than skin tone parameters, including for example, eye color parameters, eyebrow 
color parameters, lip color parameters, and hair color parameters. Identifying the skin color parameters in the eye- 
nose-mouth region improves the accuracy of the range of color parameters that are sampled, since the identification 
of the ENM region is a strong indication of the presence of a skin area. Also, computational complexity is advantageously 
reduced, because the ENM region is smaller than the well-fitted ellipse from which it is derived. 

Detection of the eyes-nose-mouth region may also be affected when the subject does not look directly at the 
camera, which often occurs for example, in video teleconferencing situations. The ENM region detector 52 also includes 
detection of an eyes-nose-mouth region for an input video image where the subject does not directly face the camera, 
the subject has facial hair and/or wears eyeglasses. The ENM region detector 52 exploits the typical symmetry of facial 
features with respect to a longitudinal axis going through the nose and across the mouth, where the axis of symmetry 
may be slanted at an angle 9,, as shown in FIG. 8, with respect to the vertical axis of the image. For such slanted 
ellipses, the rectangular window 60 is rotated by discrete angle values about the center of the window, in order to 
provide robustness in the detection of the eye-nose-mouth region. Advantageously, angle Oi has a value within the 
range of -10 degrees to 10 degrees. 

Skin area detector 12 is optionally used in conjunction with a video coder/decoder (codec) such as video codec 
10. The following explanation discusses the operation of skin area detector 12 with regard to the other component 
parts of video codec 10 as shown in FIG. 1 . Video codec 10 includes video coder 22 and video decoder 24, where 
video coder 22 is controlled by coding controller 16. For coding operations the video codec 10 receives an input video 
signal 26, which is fonwarded to the skin area detector 12 and video coder 22. The skin area detector 12 analyzes the 
input video signal as described above and provides information related to the location of skin areas to the coding 
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controller 1 6. The video coder 22 codes the input video signal under the control of the coding controller 1 6 to generate 
an output coded bitstream 30, wherein the skin areas, identified using the above described skin area detector, are 
encoded with a higher nunnber of bits than are areas that are not so identified. For example, a coding controller, such 
as coding controller 16, typically encodes and transmits only those discrete cosine transform (DCT) data components, 
which have a value above some threshold value (quantization factor). As an illustrative example, assume that an area 
of 16 X 16 pixels has data components whose values range from 1 to 16 and that the threshold value was selected to 
be 8. Then, the coding controller will only code those DCT data components whose values are above the threshold 
value of 8. However, in the embodiment of the present invention, the data components having values below the thresh- 
old value, for portions of the video signal that are identified as containing skin areas, now are encoded along with the 
data components having values above the threshold value. As a result, the areas of the video image that are identified 
as skin areas are encoded with a higher number of bits than areas that are not so identified. In one embodiment, the 
video coder 22 encodes the input video signal 26 using a source coder 32, a video multiplex coder 34, a transmission 
buffer 36, and a transmission coder 38 to generate the output coded bitstream 30. 

For decoding operations, the video codec 1 0 receives an input coded bitstream 40. The video decoder 24 decodes 
the input coded bitstream 40 using a receiving decoder 42, a receiving buffer 44, a video multiplex decoder 46, and a 
source decoder 48 for generating the output video signal 50. 

It should, of course be understood that while the present invention has been described with reference to an illus- 
trative embodiment, other arrangements may be apparent to those of ordinary skill in the art. 



Claims 

1 . An apparatus for determining skin tone in a video signal, the apparatus comprising: 

a locator which analyzes at least a portion of the video signal to identify objects of a desired shape; and 
a detector for analyzing at least one pixel from at least one of the identified objects of the desired shape to 
determine whether the analyzed pixel has a luminance parameter indicative of a skin area. 

2. The apparatus of claim 1, wherein the desired shape is a shape that is likely to contain a skin area. 

3. The apparatus of claim 2, wherein the desired shape has an arc associated with a human shape. 

4. The apparatus of claim 3, wherein the desired shape is elliptical. 

5. The apparatus of claim 1, wherein the luminance parameter indicative of the skin area is an altemating current 
(AC) signal energy component of the analyzed pixeL 

6. The apparatus of claim 1 , wherein the detector further samples the analyzed pixel to determine at least one color 
parameter of the pixel. 

7. The apparatus of claim 6, wherein the at least one color parameter is a chrominance parameter 

8. The apparatus of claim 6, wherein the detector further includes a comparator which compares the determined at 
least one color parameter of the analyzed pixel with a 

plurality of color parameters in nonanalyzed pixels of the video signal, to identify the plurality of color param- 
eters in nonanalyzed pixels which are identical to the determined 
at least one color parameter of the analyzed pixel. 

9. The apparatus of claim 6, wherein a coder generates a code segment based on the location of the at least one 
color parameter of the analyzed pixel. 

10. The apparatus of claim 3, wherein the arc associated with the human shape is 

analyzed to determine whether the shape contains pixels associated with an eyes-nose-mouth (ENM) region. 

11. The apparatus of claim 10, wherein the pixels not associated with the eyes-nose-mouth (ENM) region are excluded 
from analysis by the detector 
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12. A method for determining skin tone in a video signal, the method comprising the steps of: 

analyzing at least a portion of the video signal to identify objects of a desired shape; and 

analyzing at least one pixel from at least one of the identified objects of the desired shape to determine whether 

the analyzed pixel has a luminance parameter indicative of a skin area. 

13. The method of claim 1 , wherein the desired shape is a shape that is likely to contain a skin area. 

14. The method of claim 1 3, wherein the desired shape has an arc associated with a human shape. 

15. The method of claim 14, wherein the desired shape is elliptical. 

16. The method of claim 1 2, wherein the luminance parameter indicative of the skin area is an alternating current (AC) 
signal energy component of the analyzed pixel. 

17. The method of claim 1 2, further comprising the step of sampling the analyzed pixel to determine at least one color 
parameter of pixel. 

18. The method of claim 17, wherein the at least one color parameter is a chrominance parameter 

19. The method of claim 17, further comprising the step of comparing the determined at least one color parameter of 
the analyzed pixel with a plurality of color parameters in nonanalyzed pixels of the video signal, to identify the 
plurality of color parameters in nonanalyzed pixels which are identical to the determined at least one color param- 
eter of the analyzed pixel. 

20. The method of claim 17, further comprising the step of generating a code segment based on the location of the at 
least one color parameter of the analyzed pixel. 

21. The method of claim 14, further comprising the step of analyzing the arc associated with the human shape to 
determine whether the shape contains pixels associated with an eyes-nose-mouth (ENM) region. 

22. The method of claim 21 , wherein the pixels not associated with the eyes-nose-mouth (ENM) region are excluded 
from analysis by the detector. 
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